index.md (3894B)
1 +++ 2 title = 'Reinforcement learning' 3 template = 'page-math.html' 4 +++ 5 # Reinforcement learning 6 7 ## What is reinforcement learning? 8 Agent is in a state, takes an action. 9 Action is selected by policy - function from states to actions. 10 The environment tells the agent its new state, and provides a reward (number, higher is better). 11 The learner adapts the policy to maximise expectation of future rewards. 12 13 Markov decision process: optimal policy may not depend on previous state, only info in current state counts. 14 15 ![90955f3da8fb0d61c2fa9f3033c65098.png](e78427ef0d0845d0ae21e1c7857d2740.png) 16 17 Sparse loss: 18 - start with imitation learning - supervised learning, copying human action 19 - reward shaping - guessing reward for intermediate states, or states close to good states 20 - auxiliary goals - curiosity, max distance traveled 21 22 policy network: NN with input of state, output of action, and a softmax output layer to produce prob distribution. 23 24 three problems of RL: 25 - non differentiable loss 26 - balance exploration and exploitation 27 - this is a classic trade-off in online learning 28 - for example, an agent in a maze may train to reach a reward of 1 that's close by and exploit that reward, and so it might never explore further and reach the 100 reward at the end of the maze 29 - delayed reward/sparse loss 30 - you might take an action that causes a negative result, but the result won't show up until some time later 31 - for example, if you start studying before an exam, that's a good thing. 32 the issue is that you started one day before, and didn't do jack shit during the preceding two weeks. 33 - credit assignment problem: how do you know which action takes the credit for the bad result? 34 35 deterministic policy - every state followed by same action. 36 probabilistic policy - all actions possible, certain actions higher probability. 37 38 ## Approaches 39 how do you choose the weights (how do you learn)? 40 simple backpropagation doesn't work - we don't have labeled examples to tell us which move to take for given state. 41 42 ### Random search 43 pick random point m in model space. 44 45 ``` 46 loop: 47 pick random point m' close to m 48 if loss(m') < loss(m): 49 m <- m' 50 ``` 51 52 "close to" is sampled uniformly among all points with some pre-chosen distance r from w. 53 ### Policy gradient 54 follow some semi-random policy, wait until reach reward state, then label all previous state-action pairs with final outcome. 55 i.e. if some actions were bad, on average will occur more often in sequences ending with negative reward, and on average will be more often labeled as bad. 56 57 ![442f7f9bc5e14ffbbcfd54f6ea6b72df.png](c484829362004f90be2b33a92acf7fd9.png) 58 59 $\nabla 𝔼_a r(a) = \nabla \sum_{a} p(a) r(a) = 𝔼_{a} r(a) \nabla \ln{p(a)}$, r is the ultimate reward at the end of the trajectory. 60 61 ### Q-learning 62 If I need this, I'll make better notes, can't really understand it from the slides. 63 64 ## Alpha-stuff 65 ### AlphaGo 66 starts with imitation learning. 67 improve by playing against previous iterations and self. trained by reinforcement learning using policy gradient descent to update weights. 68 during play, use Monte Carlo Tree Search, with node values being the prob that black will win from that state. 69 70 ### AlphaZero 71 learns from scratch, there's no imitation learning or reward shaping. 72 also applicable to other games like chess. 73 74 Improves AlphaGo by: 75 - combining policy and value nets 76 - viewing MCTS as policy improvement operator 77 - adding residual connections, batch normalization 78 79 ### AlphaStar 80 This shit can play starcraft. 81 82 Real time, imperfect information, large diverse action space, and no single best strategy. 83 Its behaviour is generated by a deep NN that gets input from game interface, and outputs instructions that are an action in the game. 84 85 it has a transformer torso for units 86 deep LSTM core with autoregressive policy head, and pointer network. 87 makes use of multi-agent learning.